{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 5.2 Speech Recognition\n",
    "\n",
    "Speech recognition, also known as automatic speech recognition (ASR), is a technology that converts spoken words into written format or executes specific actions based on verbal commands. It involves machine learning models that analyze speech patterns, phonetics, and language structures to accurately transcribe and understand human speech.\n",
    "\n",
    "[Whisper](https://openai.com/research/whisper), published by OpenAI, is a popular open-source model for both ASR and speech translation. This means that Whisper has the capability to transcribe speech in multiple languages and facilitate translation from those languages into English.\n",
    "\n",
    "Due to its underlying Transformer-based encoder-decoder architecture, Whisper can be optimized effectively with BigDL-LLM INT4 optimizations. In this tutorial, we will guide you through building a speech recognition application on BigDL-LLM optimized Whisper model that can transcribe/translate audio files into text.\n",
    "\n",
    "## 5.2.1 Install Packages\n",
    "\n",
    "Follow instructions in [Chapter 2](../ch_2_Environment_Setup/README.md) to setup your environment if you haven't done so. Then install bigdl-llm:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --pre --upgrade bigdl-llm[all]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to the requirement to process audio file, you will also need to install the `librosa` package for audio analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U librosa"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.2 Download Audio Files\n",
    "\n",
    "To begin, let's prepare some audio files. As an example, you can download [an English example](https://datasets-server.huggingface.co/assets/patrickvonplaten/librispeech_asr_dummy/--/clean/validation/2/audio/audio.wav) from English audio dataset [librispeech_asr_dummy](https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy) and [one Chinese example](https://datasets-server.huggingface.co/assets/carlot/AIShell/--/692ef58020d79b21f54eb25b15a4813d4f9650d7/--/default/train/84/audio/audio.wav) from the Chinese audio dataset [AIShell](https://huggingface.co/datasets/carlot/AIShell). Here, the English audio file and the Chinese audio file have been randomly selected. Feel free to choose different audio files according to your preferences."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we rename the files to `audio_en.wav` and `audio_zh.wav` and put them in the current path.You could play the successfully-downloaded audio:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "\n",
    "IPython.display.display(IPython.display.Audio(\"audio_en.wav\"))\n",
    "IPython.display.display(IPython.display.Audio(\"audio_zh.wav\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.3 Load Pretrained Whisper Model\n",
    "\n",
    "Now, let's load a pretrained Whisper model, e.g. [whisper-medium](https://huggingface.co/openai/whisper-medium) as an example. OpenAI has released pretrained Whisper models in various sizes (including [whisper-small](https://huggingface.co/openai/whisper-small), [whisper-tiny](https://huggingface.co/openai/whisper-tiny), etc.), allowing you to choose the one that best fits your requirements. \n",
    "\n",
    "Simply use one-line `transformers`-style API in `bigdl-llm` to load `whisper-medium` with INT4 optimizations (by specifying `load_in_4bit=True`) as follows. Please note that model class `AutoModelForSpeechSeq2Seq` is used for Whisper:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModelForSpeechSeq2Seq\n",
    "\n",
    "model = AutoModelForSpeechSeq2Seq.from_pretrained(pretrained_model_name_or_path=\"openai/whisper-medium\",\n",
    "                                                  load_in_4bit=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.4 Load Whisper Processor\n",
    "\n",
    "A Whisper processor is also needed for both audio pre-processing, and post-processing model outputs from tokens to texts. Just use the official `transformers` API to load `WhisperProcessor`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from transformers import WhisperProcessor\n",
    "\n",
    "processor = WhisperProcessor.from_pretrained(pretrained_model_name_or_path=\"openai/whisper-medium\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.5 Transcribe English Audio\n",
    "\n",
    "Once you have optimized the Whisper model using BigDL-LLM with INT4 optimization and loaded the Whisper processor, you are ready to begin transcribing the audio through model inference.\n",
    "\n",
    "Let's start with the English audio file `audio_en.wav`. Before we feed it into Whisper processor, we need to extract sequence data from raw speech waveform:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import librosa\n",
    "\n",
    "data_en, sample_rate_en = librosa.load(\"audio_en.wav\", sr=16000)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> For `whisper-medium`, its `WhisperFeatureExtractor` (part of `WhisperProcessor`) extracts features from audio using a 16,000Hz sampling rate by default. It's important to load the audio file at the sample sampling rate with model's `WhisperFeatureExtractor` for precise recognition.\n",
    "\n",
    "We can then proceed to transcribe the audio file based on the sequence data, using exactly the same way as using official `transformers` API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------- English Transcription --------------------\n",
      "[' He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind.']\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import time\n",
    "\n",
    "# define task type\n",
    "forced_decoder_ids = processor.get_decoder_prompt_ids(language=\"english\", task=\"transcribe\")\n",
    "\n",
    "with torch.inference_mode():\n",
    "    # extract input features for the Whisper model\n",
    "    input_features = processor(data_en, sampling_rate=sample_rate_en, return_tensors=\"pt\").input_features\n",
    "\n",
    "    # predict token ids for transcription\n",
    "    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids,max_new_tokens=200)\n",
    "\n",
    "    # decode token ids into texts\n",
    "    transcribe_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n",
    "\n",
    "    print('-'*20, 'English Transcription', '-'*20)\n",
    "    print(transcribe_str)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> `forced_decoder_ids` defines the context token for different language and task (transcribe or translate). If it is set to `None`, Whisper will automatically predict them.\n",
    "\n",
    "\n",
    "## 5.2.6 Transcribe Chinese Audio and Translate to English\n",
    "\n",
    "Then let's move to the Chinese audio `audio_zh.wav`. Whisper can transcribe multilingual audio, and translate them into English. The only difference here is to define specific context token through `forced_decoder_ids`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------- Chinese Transcription --------------------\n",
      "['这样能相对保障产品的质量']\n",
      "-------------------- Chinese to English Translation --------------------\n",
      "[' This can ensure the quality of the product relatively.']\n"
     ]
    }
   ],
   "source": [
    "# extract sequence data\n",
    "data_zh, sample_rate_zh = librosa.load(\"audio_zh.wav\", sr=16000)\n",
    "\n",
    "# define Chinese transcribe task\n",
    "forced_decoder_ids = processor.get_decoder_prompt_ids(language=\"chinese\", task=\"transcribe\")\n",
    "\n",
    "with torch.inference_mode():\n",
    "    input_features = processor(data_zh, sampling_rate=sample_rate_zh, return_tensors=\"pt\").input_features\n",
    "    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)\n",
    "    transcribe_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n",
    "\n",
    "    print('-'*20, 'Chinese Transcription', '-'*20)\n",
    "    print(transcribe_str)\n",
    "\n",
    "# define Chinese transcribe and translation task\n",
    "forced_decoder_ids = processor.get_decoder_prompt_ids(language=\"chinese\", task=\"translate\")\n",
    "\n",
    "with torch.inference_mode():\n",
    "    input_features = processor(data_zh, sampling_rate=sample_rate_zh, return_tensors=\"pt\").input_features\n",
    "    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids, max_new_tokens=200)\n",
    "    translate_str = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n",
    "\n",
    "    print('-'*20, 'Chinese to English Translation', '-'*20)\n",
    "    print(translate_str)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2.7 What's Next?\n",
    "\n",
    "In the upcoming chapter, we will explore the usage of BigDL-LLM in conjunction with langchain, a framework designed for developing applications with language models. With langchain integration, application development process could be simplified."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "bigdl-llm-tutorial",
   "language": "python",
   "name": "bigdl-llm-tutorial"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
